Skip to content

Conversation

@tobixdev
Copy link
Contributor

@tobixdev tobixdev commented Nov 8, 2025

Which issue does this PR close?

This is a draft for #18223 . The APIs are not to be considered final (e.g., options are missing in the pretty printer).
The primary purpose is to spark discussion for now.

So happy to hear inputs!

Rationale for this change

How cool would it be to just state that you should properly format my byte-encoded uuids? :)

What changes are included in this PR?

  • Defines the LogicalType trait for some canonical extension types from arrow.
  • Defines UnresolvedExtensionType, a "DataFusion canonical extension type" that can be used to create a LogicalType instance even without a registry. For example, the new_... functions for DFSchema could make use of this type, as they currently have no access to a registry. Furthermore, these function could directly instantiate the canonical arrow extension types as they are known to the system. Then the functions could resolve native and canonical extension types themselves without an access to the registry and then "delay" the resolving of the custom extension types. The idea is that there is then a "Type Resolver Pass" that has access to a registry and replaces all instances of this type with the actual one. While I hope that this is only a temporary solution until all places have access to a logical type registry, I think this has the potential to become a "permanent temporary solution". With this in mind, we could also consider making this explicit with an enum and not hide it behind dynamic dispatch.
  • Defines an incomplete ValuePrettyPrinter for showcasing the UUID pretty printing.
  • Plumbing for having ExtensionTypeRegistry in SessionState

What is also important is what is not included: an integrative example of making use of the pretty printer. I tried several avenues but, as you can imagine, each change to the core data structure is a huge plumbing effort (hopefully reduced by the existence of UnresolvedLogicalType).

I really like the suggestion by @paleolimbot to use pretty-printing record batches as the first use case. You can see a mini example in the test that pretty-prints UUIDs. The nice thing is that this probably would not require much plumbing as the [DataFrame] already has access to the [SessionState]. The only thing that's missing for me to actually include this example here is that arrow-rs does not currently support passing custom pretty printers in pretty_format_batches_with_options.

Imagine that the to_string function in the DataFrame does the following:

  1. Look up any extension type information from the schema (in a future world this would already be part of the schema and another lookup is not necessary)
  2. Gather the pretty printers
  3. Pass in pretty printer to arrow-rs for formatting.

If you think this is a worthwhile pursuit we could add the capability to arrow-rs.

Are these changes tested?

Not really, as there is not integrative example yet.

Are there any user-facing changes?

There would be.

@tobixdev tobixdev changed the title [DISCUSSION] Extension Type Registry [DISCUSSION] Extension Type Registry Draft Nov 8, 2025
@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate common Related to common crate functions Changes to functions implementation labels Nov 8, 2025
@tobixdev tobixdev changed the title [DISCUSSION] Extension Type Registry Draft [DRAFT] Extension Type Registry Draft Nov 9, 2025
@tobixdev tobixdev marked this pull request as draft November 9, 2025 08:09
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I think this is a great demonstration of the components we need: something that represents a "type" in the way that we actually mean, something to instantiate it from name/metadata/storage type, and some concrete problem that can be solved by combining those.

It might be a nice addition to this demonstration the code required to integrate this to Expr such that Display for an Expr::Literal or the CLI printer can make use of this (maybe an optimizer rule could hydrate any unresolved logical types?). I like the LogicalType as an endpoint but it is a valid concern that it might disrupt the existing logical Expr and such a demonstration might help alleviate that concern.

The ExprField idea I had was basically to start with struct ExprField { inner: FieldRef } (breaking change) and slowly migrate to struct ExprField { logical_type: LogicalType, nullability: bool, ... <other stuff?>} in a non-breaking way. It's very possible that going straight to the LogicalType is just as easy to do (I haven't tried it!).

Comment on lines +70 to +75
/// Represents the canonical [Opaque extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#opaque).
///
/// In the context of DataFusion, a common use case of the opaque type is when an extension type
/// is unknown to DataFusion. Contrary to [UnresolvedExtensionType], the extension type has
/// already been checked against the extension type registry and was not found.
impl LogicalType for Opaque {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure mixing the canonical extension "Opaque" and Opaque here is a good idea, but I do think this logical type is a good one to have...it is a place where we can define what happens to a field with ARROW:extension:name in the absence of any other guidance (e.g., here we would print values identically to the storage type...we could define equality here using byte-for-byte equality of metadata if we want to be strict, or using storage type equality if we want to be lenient).

Comment on lines +177 to +179
fn pretty_print_scalar(&self, value: &ScalarValue) -> Result<String> {
Ok(format!("datafusion.unresolved({value})"))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this error? Like, if this code is called, it meant a programming error led to the logical type not being resolved?

Or maybe this should be unified with the Opaque logical type for now (in other words, we'll always construct Opaque extensions which have acceptable default behaviour, and work towards ensuring they are resolved everywhere).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate functions Changes to functions implementation logical-expr Logical plan and expressions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants